perm filename MULTID[4,KMC]4 blob sn#045452 filedate 1973-05-29 generic text, type T, neo UTF8
00100		MULTIDIMENSIONAL EVALUATION OF  A SIMULATION
00200		       OF PARANOID THOUGHT PROCESSES
00300	
00400	               KENNETH MARK COLBY 
00500	                     AND
00600	              FRANKLIN DENNIS HILF
00700	
00800		Once  a  simulation  model  reaches  a  stage  of   intuitive
00900	adequacy,  a  model  builder  should  consider  using  more stringent
01000	evaluation procedures relevant to the model's purposes. For  example,
01100	if  the  model  is  to serve as a as a training device, then a simple
01200	evaluation of its pedagogic effectiveness would be sufficient.    But
01300	when  the  model  is  proposed  as  an  explantion of a psychological
01400	process, more is demanded of the evaluation procedure.
01500		We shall first  give  a  brief  description  of  a  model  of
01600	paranoid  processes. A more complete account can be found in Colby,  
01700	Weber, and Hilf [1]. We shall then discuss the evaluation
01800	problem which asks "how good is the model?"  or  "how  close  is  the
01900	correspondence  between the behavior of the model and the phenomenena
02000	it is intended to explain?"
02100	       (LEE-- INSERT DESCRIPTION OF MODEL HERE)          
02200		Turing's test has often been suggested as a validation procedure.
02300	It  is  very easy to become confused about Turing's Test.  In
02400	part this is due to Turing  himself  who  introduced  the  now-famous
02500	imitation   game   in   a  paper  entitled  COMPUTING  MACHINERY  AND
02600	INTELLIGENCE (Turing,1950).  A careful reading of this paper  reveals
02700	there  are  actually  two  imitation  games  , the second of which is
02800	commonly called Turing's test.
02900		In the first imitation game  two  groups  of  judges  try  to
03000	determine which of two interviewees is a woman. Communication between
03100	judge and  interviewee  is  by  teletype.  Each  judge  is  initially
03200	informed  that  one  of the interviewees is a woman and one a man who
03300	will pretend to be a woman. After the interview, the judge  is  asked
03400	what  we shall call the woman-question i.e. which interviewee was the
03500	woman?  Turing does not say what else  the  judge  is  told  but  one
03600	assumes  the  judge is NOT told that a computer is involved nor is he
03700	asked to determine which  interviewee  is  human  and  which  is  the
03800	computer.  Thus,  the  first  group  of  judges  would  interview two
03900	interviewees:    a woman, and a man pretending to be a woman.
04000		The  second  group  of judges would be given the same initial
04100	instructions, but unbeknownst to them, the two interviewees would  be
04200	a  woman  and a computer programmed to imitate a woman.   Both groups
04300	of judges  play  this  game  until  sufficient  statistical  data are
04400	collected  to  show  how  often the right identification is made. The
04500	crucial question then is:  do the judges decide wrongly AS OFTEN when
04600	the  game  is  played  with man and woman as when it is played with a
04700	computer substituted  for  the  man.  If  so,  then  the  program  is
04800	considered  to  have  succeeded in imitating a woman as well as a man
04900	imitating  a  woman.    For  emphasis  we  repeat;  in   asking   the
05000	woman-question  in  this  game,  judges  are not required to identify
05100	which interviewee is human and which is machine.
05200		Later  on  in  his  paper  Turing proposes a variation of the
05300	first game. In the second game one interviewee is a man and one is  a
05400	computer.   The judge is asked to determine which is man and which is
05500	machine, which we shall call the machine-question. It is this version
05600	of  the game which is commonly thought of as Turing's test.    It has
05700	often been suggested as a means of validating computer simulations of
05800	psychological processes.
05900		In  the  course  of  testing a simulation (PARRY) of paranoid
06000	linguistic behavior in a psychiatric interview, we conducted a number
06100	of  Turing-like  indistinguishability  tests  (Colby,  Hilf,Weber and
06200	Kraemer,1972). We say `Turing-like' because none of them consisted of
06300	playing  the  two  games  described above. We chose not to play these
06400	games for a number of reasons which can be summarized by saying  that
06500	they  do  not  meet modern criteria for good experimental design.  In
06600	designing our tests we were primarily  interested  in  learning  more
06700	about   developing   the  model.   We  did  not  believe  the  simple
06800	machine-question to be  a  useful  one  in  serving  the  purpose  of
06900	progressively   increasing  the  credibility  of  the  model  but  we
07000	investigated a variation of it to satisfy the curiosity of colleagues
07100	in artificial intelligence.
07200		In this design eight psychiatrists  interviewed  by  teletype
07300	two  patients  using  the  technique of machine-mediated interviewing
07400	which involves  what  we  term  "non-nonverbal"  communication  since
07500	non-verbal   cues   are   made  impossible  (Hilf,1972).  Each  judge
07600	interviewed two patients one being PARRY and one being a hospitalized
07700	paranoid  patient.    The  interviewers  were  not  informed  that  a
07800	simulation was involved nor were they asked to identify which was the
07900	machine. Their task was to conduct a diagnostic psychiatric interview
08000	and rate each response from the  `patients'  along  a  0-9  scale  of
08100	paranoidness,  0  meaning  zero  and  9 being highest. Transcripts of
08200	these interviews, without the ratings of the interviewers, were  then
08300	utilized  for  various  experiments in which randomly selected expert
08400	judges conducted evaluations  of  the  interview  transcripts.    For
08500	example,  in one experiment it was found that patients and model were
08600	indistinguishable along the dimension of paranoidness.
08700		To ask the machine-question, we sent  interview  transcripts,
08800	one  with a patient and one with PARRY, to 100 psychiatrists randomly
08900	selected from the Directory of American Specialists and the Directory
09000	of  the  American Psychiatric Association. Of the 41 replies 21 (51%)
09100	made the correct identification while 20 (49%) were wrong.  Based  on
09200	this  random  sample of 41 psychiatrists, the 95% confidence interval
09300	is between 35.9 and 66.5, a range which  is  close  to  chance.  (Our
09400	statistical   consultant   was  Dr.   Helena  C.   Kraemer,  Research
09500	Associate  in  Biostatistics,  Department  of  Psychiatry,   Stanford
09600	University.)
09700		Psychiatrists   are   considered  expert  judges  of  patient
09800	interview behavior but they are unfamiliar with computers.  Hence  we
09900	conducted  the  same  test  with  100  computer  scientists  randomly
10000	selected from the membership list of the  Association  for  Computing
10100	Machinery,  ACM.   Of the 67 replies 32 (48%) were right and 35 (52%)
10200	were wrong. Based on this random sample of 67 computer scientists the
10300	95% confidence ranges from 36 to 60, again close to a chance level.
10400		Thus the answer to this machine-question "can expert  judges,
10500	psychiatrists  aand  computer scientists, using teletyped transcripts
10600	of psychiatric interviews, distinguish between paranoid patients  and
10700	a  simulation  of paranoid processes? " is "No". But what do we learn
10800	from this?   It is some comfort that the answer was not "yes"and  the
10900	null  hypothesis  (no  differences) failed to be rejected, especially
11000	since statistical tests are somewhat biased in favor of rejecting the
11100	null  hypothesis  (Meehl,1967). Yet this answer does not tell us what
11200	we  would  most  like  to  know,  i.e.  how  to  improve  the  model.
11300	Simulation  models  do  not  spring  forth in a complete, perfect and
11400	final form; they must be gradually developed  over  time.  Pehaps  we
11500	might  obtain  a "yes" answer to the machine-question if we allowed a
11600	large number of expert judges to conduct  the  interviews  themselves
11700	rather  than studying transcripts of other interviewers.     It would
11800	indicate that the model must be improved but unless we systematically
11900	investigated how the judges succeeded in making the discrimination we
12000	would not know what aspects of the model to work on. The logistics of
12100	such a design are immense and obtaining a large N of judges for sound
12200	statistical inference would require an effort disproportionate to the
12300	information-yield.
12400		A more efficient and informative way to use Turing-like tests
12500	is to ask judges to make ordinal ratings along scaled dimensions from
12600	teletyped  interviews.     We  shall  term  this  approach asking the
12700	dimension-question.   One can then compare scaled ratings received by
12800	the patients and by the model to precisely determine where and by how
12900	much they differ.        Model builders  strive  for  a  model  which
13000	shows     indistinguishability     along    some    dimensions    and
13100	distinguishability along others.  That is, the model converges on what
13200	it is supposed to simulate and diverges from that which it is not.
13300		We  mailed  paired-interview  transcripts  to   another   400
13400	randomly  selected psychiatrists asking them to rate the responses of
13500	the two `patients' along certain dimensions. The judges were  divided
13600	into  groups,  each  judge  being asked to rate responses of each I-O
13700	pair in the interviews along four dimensions.  The  total  number  of
13800	dimensions  in  this  test  were twelve- linguistic noncomprehension,
13900	thought disorder, organic brain syndrome, bizarreness,  anger,  fear,
14000	ideas  of  reference, delusions, mistrust, depression, suspiciousness
14100	and mania. These are dimensions which psychiatrists commonly  use  in
14200	evaluating patients.
14300		Table 1 shows there were significant differences, with  PARRY
14400	receiving   higher   scores   along   the  dimensions  of  linguistic
14500	noncomprehension,thought disorder, bizarreness, anger,  mistrust  and
14600	suspiciousness. On the dimension of delusions the patients were rated
14700	significantly higher. There were no significant differences along the
14800	dimensions  of  organic  brain  syndrome,fear,  ideas  of  reference,
14900	depression and mania.
15000		While    tests    asking    the   machine-question   indicate
15100	indistinguishability at  the  gross  level,  a  study  of  the  finer
15200	structure  os  the  model's  behavior  through  ratings  along scaled
15300	dimensions  shows  statistically  significant   differences   between
15400	patients  and  model.     These  differences are of help to the model
15500	builder in suggesting which aspects of the model must be modified and
15600	improved  in  order  to  be  considered an adequate simulation of the
15700	class of paranoid patients it is intended to simulate.  For  example,
15800	it  is  clear  that  PARRY'S language-comprehension must be improved.
15900	Once this has been implemented, a future test will  tell  us  whether
16000	improvement has occurred and by how much in comparison to the earlier
16100	version.   Successive identification of particular areas  of  failure
16200	in the model permits their improvement and the development of
16300	more adequate model-versions.
16400		Further evidence that the machine-question is too coarse  and
16500	insensitive  a test comes from the following experiment. In this test
16600	we constructed a random version of the paranoid model which  utilized
16700	PARRY'S  output statements but expressed them randomly no matter what
16800	the interviewer said.   Two psychiatrists conducted  interviews  with
16900	this  model, transcripts of which were paired with patient interviews
17000	and sent to 200  randomly  selected  psychiatrists  asking  both  the
17100	machine-question  and  the dimension-question.  Of the 69 replies, 34
17200	(49%) were right and 35 (51%) wrong. Based on this random  sample  of
17300	69  psychiatrists,  the 95% confidence interval ranges from 39 to 63,
17400	again indicating  a  chance  level.  However  as  shown  in  Table  2
17500	significant  differences  appear  along  the dimensions of linguistic
17600	noncomprehension, thought disorder and bizarreness, with RANDOM-PARRY
17700	rated  higher.  On  these  particular  dimensions  we can construct a
17800	continuum in which the random version  represents  one  extreme,  the
17900	actual patients another. Our (nonrandom) PARRY lies somewhere between
18000	these two extremes, indicating that it performs significantly  better
18100	than  the  random version but still requires improvement before being
18200	indistinguishable from  patients.(See  Fig.1).  Table  3  presents  t
18300	values   for   differences   between   mean   ratings  of  PARRY  and
18400	RANDOM-PARRY. (See Table 2 and Fig.1 for the mean ratings).
18500	The fact that even a random model can pass the machine-question test
18600	shows, not that the model is a good simulation, but that the test
18700	is  weak and nonchallenging.
18800		Thus it can be seen that such a multidimensional evaluation
18900	provides  yardsticks  for measuring the adequacy of this or any other
19000	dialogue simulation model along the relevant dimensions.
19100		We conclude that when model builders want  to  conduct  tests
19200	which  indicate  in  which  direction  progress  lies and to obtain a
19300	measure of whether  progress  is  being  achieved,  the  way  to  use
19400	Turing-like  tests  is  to  ask  expert  judges to make ratings along
19500	multiple dimensions that are essential to the model.  Useful tests do
19600	not  prove  a  model, they probe it for its strengths and weaknesses.
19700	Simply asking the machine-question yields little information relevant
19800	to what the model builder most wants  to  know,  namely,  along  what
19900	dimensions must the model be improved.
20000	
20100	
20200			REFERENCES
20300	
20400	[1]  Colby, K.M., Weber, S. and Hilf,F.D.,1971. Artificial paranoia. 
20500	       ARTIFICIAL INTELLIGENCE,2, 1-25.
20600	
20700	
20800	[2]  Colby,K.M.,Hilf,F.D.,Weber, S.and Kraemer,H.C.,1972. Turing-like
20900		indistinguishability tests for the validation  of a  computer
21000		simulation  of paranoid  processes. ARTIFICIAL  INTELLIGENCE,3,
21100		199-221.
21200	
21300	[3]  Hilf, F.D.,1972. Non-nonverbal communication and psychiatric research.
21400	               ARCHIVES OF GENERAL PSYCHIATRY, 27, 631-635.
21500	[4]  Meehl, P.E.,1967. Theory testing in  psychology  and  physics: a
21600		methodological paradox. PHILOSOPHY OF SCIENCE,34,103-115.
21700	
21800	[5]  Turing,A.,1950. Computing machinery and intelligence. Reprinted in:
21900		COMPUTERS AND THOUGHT (Feigenbaum, E.A. and Feldman, J.,eds.).
22000		McGraw-Hill, New York,1963,pp. 11-35.
22100	
22200	
22300			ACKNOWLEDGEMENTS
22400	
22500	This research is supported by Grant PHS MH 06645-12 from the National
22600	Institute of Mental Health and by (in part) Research Scientist Award
22700	(No. 1-K05-K-14,433) from the National Institute of Mental Health to
22800	the senior author.